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Abstract 

We present a simple randomized procedure for the prediction of a binary sequence. 
The algorithm uses ideas from recent developments of the theory of the prediction of 
individual sequences. We show that if the sequence is a realization of a stationary and 
ergodic random process then the average number of mistakes converges, almost surely, 
to that of the optimum, given by the Bayes predictor. The desirable finite-sample 
properties of the predictor are illustrated by its performance for Markov processes. In 
such cases the predictor exhibits near optimal behavior even without knowing the order 
of the Markov process. Prediction with side information is also considered. 







1 Introduction 



We address the problem of sequential prediction of a binary sequence. A sequence of bits 
ViiVii ■ ■ ■ £ {0)1} is hidden from the predictor. At each time instant i = 1,2,..., the 
predictor is asked to guess the value of the next outcome i/i with knowledge of the past 
y\ _1 = (yi, . . . , yi-i) (where y® denotes the empty string). Thus, the predictor's decision, 
at time i, is based on the value of yl -1 . We also assume that the predictor has access 
to a sequence of independent, identically distributed (i.i.d.) random variables C/i, C/2, - • -, 
uniformly distributed on [0,1], so that the predictor can use Ui in forming a randomized 
decision for y { . Formally, the strategy of the predictor is a sequence g = {gi}^ of decision 
functions 

9i : {0, l}*" 1 x [0, 1] - {0, 1} 

and the randomized prediction formed at time % is gi(y\~ 1 ,Ui). The predictor pays a unit 
penalty each time a mistake is made. After n rounds of play, the normalized cumulative loss 
on the string y" is 

1 n 

Li(9>U?) = -E^foi- 1 ,!^}' 
n 1=1 

where / denotes the indicator function. When no confusion is caused, or when the predictor 
does not randomize, we will simply write L r {(g) = L™(g,U[ l ). In general, we denote the 
average number of mistakes between times m and n by 

1 n 

L m(9, UZ) = n _ m + 1 E hm^WviY 

We also write 

L^g) = EL^g, V?) and L n m (g) = BL n m (g, U^) 

for the expected loss of the randomized strategy g. (Here the expectation is taken with 
respect to the randomization [/".) 

In this paper we assume that 2/1 , 2/2 , • • - are realizations of the random variables Y±, Y 2 , . . . 
drawn from the binary- valued stationary and ergodic process {^nj^oo- We assume that the 
randomizing variables Ui, U2, ■ ■ ■ are independent of the process {1^}^. 

In this case there is a fundamental limit for the predictability of the sequence. This is 
stated in the next theorem whose proof may be found in Algoet [2]. 

Theorem 1 (Algoet [2]) For any prediction strategy g and stationary ergodic process {Yn} ?^, 

liminf L™(g) > L* almost surely, 

where 



E 



min (p{Yo = l\Y-^}, P{Y = 0|F^}) 



is the minimal (Bayes) probability of error of any decision for the value ofY based on the 
infinite past Y"_J^ = (..., F_ 3 , Y_ 2 , Y_i). 
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Based on Theorem 1, the following definition is meaningful: 

Definition 1 A prediction strategy g is called universal if for all stationary and ergodic 
processes {T n }-oo; 

lim L'1(g) = L* almost surely. 

Therefore, universal strategies asymptotically achieve the best possible loss for all ergodic 
processes. The first question is, of course, if such a strategy exists. The affirmative answer 
follows from a more general result of Algoet [2]. Here we give an alternative proof which is 
based on earlier results of Ornstein and Bailey. 

Theorem 2 (Algoet [2]) There exists a universal prediction scheme. 



Proof. Ornstein [19] proved that there exists a sequence of functions fa : {0, l} 1 — > [0, 1], 
% — 1, 2, . . . such that for all ergodic processes O^}^, 



Urn fn^yin) = P{*o = 1\YI^} almost surely. 
Bailey [4] observed that for such estimators, for all ergodic processes 
1 n 

lim - O/i-i^i" 1 ) - p { y * = l \ Y -^}\ = almost surely. 

i=l 



(2) 



Indeed, (1) and Breiman's generalized ergodic theorem (see Lemma 4 in the Appendix) yield 
(2). 

Once such a sequence {/$} of estimators is available, we may define a (non-randomized) 
prediction scheme by the plug-in predictor 



1 if fn-M' 1 ) > \ 

otherwise. 



It is well-known that the probability of error of such a plug-in predictor may be bounded by 
the Li error of the estimator it is based on. In particular, by a simple inequality appearing 
in the proof of [9, Theorem 2.2], 

P {g n {Yr l ) + Y n \Y0£) - P {<?* (F"- 1 ) + Y n \Y^} < 2 \f n -,{Yr l ) ~ P {Y n = 1\Y^}\ . 
Therefore, 

1 n 

L n M--Y,^{9i(yt 1 )^Y i \Yt 1 } 

n i=i 

1 n 

+- E | p {totf- 1 ) * Wr 1 } - p {g m <y£) * Y t \Y^}\ 

1 1=1 

1 n 

-Zp{g*<yi£)*nY£}-L* 



\L n M-L*\ < 



+ 



2 



< 



1 n 

n i=i 

+-E|/<-i( 1 ?" 1 )- p { y i = 1 l^»}| 

n 1=1 



The first term of the right-hand side tends to zero almost surely by the Hoeffding-Azuma 
inequality (Lemma 5 in the Appendix) and the Borel-Cantelli lemma. The second one 
converges to zero almost surely by (2) and the third term tends to zero almost surely by the 
ergodic theorem. □ 

It was Ornstein [19] who first proved the existence of estimators satisfying (1). This was 
later generalized by Algoet [1]. A simpler estimator with the same convergence property was 
introduced by Morvai, Yakowitz, and Gyorfi [17]. Unfortunately, even the simpler estimator 
needs so large amounts of data that its practical use is unrealistic. By this we mean that 
even for "simple" i.i.d. or Markov processes the rate of convergence of the estimator is very 
slow. Motivated by the need of a practical estimator, Morvai, Yakowitz, and Algoet [18] 
introduced an even simpler algorithm. However, it is not known whether their estimator 
satisfies (1), and we do not even know whether the corresponding predictor is universal. The 
purpose of this paper is to introduce a new simple universal predictor whose finite-sample 
performance for Markov processes promise practical applicability. 
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2 A simple universal algorithm 



In this section we present a simple prediction strategy, and prove its universality. It is 
motivated by some recent developments from the theory of the prediction of individual 
sequences (see, e.g., Vovk [22], Feder, Merhav, and Gutman [10], Littlestone and Warmuth 
[12], Cesa-Bianchi et al. [7]). These methods predict according to a combination of several 
predictors, the so-called experts. 

The main idea in this paper is that if the sequence to predict is drawn from a stationary 
and ergodic process, combining the predictions of a small and simple set of appropriately 
chosen predictors (the so-called experts) suffices to achieve universality. 

First we define an infinite sequence of experts hP-\hS 2 \... as follows: Fix a positive 
integer k, and for each n > 1, s e {0, l} k and y G {0, 1} define the function P k : {0, 1} x 
{0, l}"- 1 x {0, l} fe - [0, 1] by 

{k < i < n : y l ~l — s, y { — y) 



{k < i < n : y\_\ = s} 

where 0/0 is defined to be 1/2. Also, for n < k + 1 we define P^(|/,|/™ _1 , s) = 1/2. In other 
words, P£(y, y™ -1 , s) is the proportion of the appearances of the bit y following the string s 
among all appearances of s in the sequence 

The expert is a sequence of functions : {0, 1}™" 1 -> {0, 1}, n = 1, 2, . . . defined by 

hn [Vl j " \ 1 otherwise, 

That is, expert is a (nonrandomized) prediction strategy, which looks for all appearances 
of the last seen string y™Zl °f length k in the past and predicts according to the larger of 
the relative frequencies of 0's and l's following the string. We may call a k-th order 
empirical Markov strategy. 

The proposed prediction algorithm proceeds as follows: Let m = 0,1,2,... be a non- 
negative integer. For 2 m < n < 2 m+1 , the prediction is based upon a weighted majority of 
predictions of the experts . . . , /i( 2m+1 ) as follows: 



9n{y\ ,u 



if «.> y **■(*> 

ES wjk) "=1.2, 

1 otherwise, 



where w n (k) is the weight of expert defined by the past performance of as 

W 2m (k) = 1 and W n {k) = e -Vm{n-2^)L--\ h W) for <n< 2 m+l ; 

where rj m = ^8 ln(2 m+1 )/2 m . Recall that 

j n—l 



) = — l \,{k), i-lw \ 

n-2 m pf m \h\ '(y 1 )^yij 
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is the average number of mistakes made by expert between times 2 m and n — 1. The 
weight of each expert is therefore exponentially decreasing with the number of its mistakes 
on this part of the data. 

Remarks. 1. The above-mentioned estimator of Morvai, Yakowitz, and Algoet [18] selects 
a value of k in a certain data-dependent manner, and uses the corresponding estimate P*. 
The new estimate, however, takes a mixture (weighted average) of all possible values of k, 
with exponential weights depending on the past performance of each component estimator. 
As Lemma 1 below suggests, this technique guarantees a number of errors almost as small 
as that of the best expert (i.e., best value of k). 

2. Ryabko [21] proposed an estimator somewhat similar in spirit to the predictor defined 
here. Ryabko used a mixture of empirical Markov predictors, and proved its universality for 
all stationary and ergodic processes in a sense related to the Kullback-Leibler divergence. 
The idea of diversifying Markov strategies also appears in Algoet [1]. 

3. Each time n equals a power of two, all weights are reset to 1, and a simple majority vote 
is taken among the experts. This is necessary to make the algorithm sequential and to be 
able to incorporate more and more experts in the decision. If the total length of the sequence 
to be predicted was finite (say n) and known in advance, then no such resetting would be 
necessary, one could just use the first n experts as Lemma 1 below describes. However, to 
achieve universality, an infinite class of experts is necessary. As the first part of the proof of 
Theorem 3 below shows, we do not loose much by such a resetting of the weights. 

4. Related prediction schemes have been proposed by Feder, Merhav, and Gutman [10] 
for individual sequences. Their computationally quite simple methods are shown to predict 
asymptotically as well as any finite-state predictor. 

The main result of this section is the universality of this simple prediction scheme: 
Theorem 3 The prediction scheme g defined above is universal. 

In the proof we use a beautiful result of Cesa-Bianchi et al. [7] . It states that, given a set of K 
experts, and a sequence of fixed length n, there exists a randomized predictor whose number 



of mistakes is not more than that of the best predictor plus y(n/2)\nK for all possible 
sequences y™. The simpler algorithm and statement cited below is due to Cesa-Bianchi [6]: 




Lemma 1 Let hP-\ . . . , be a finite collection of prediction strategies (experts), and let 
i] > 0. Then if the prediction strategy g is defined by 




i = 



1, 2, . . where for all k — 1, . 



.,K 



Wi(k) = 1 and Wi(k) = e'^' 1 ^ 1( - m \ % > 1 
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then for every n > 1 and y™ G {0, l} n , 



L?(g)< min L?(^ fc >) + — + 
k=i,...,K nn 8 



In particular, if N is a positive integer, and n = V8iV MnfT, i/ien 



Proof of Theorem 3. Taking = 2 m+1 and iV = 2 m in Lemma 1, we have that the 
expected number of errors committed by g on a segment 2 m , . . . , 2 m+1 — 1 is bounded, for 
any y^ m 6 {0, 1} , as 



iir 1 - 1 ^) = e 



^ 2 m+1 -l 



* i=2 m 



< min Ll2 +1 -\hW) + v 



/ ln(2 m + 1 ) 



= min + 



/ln(2 m + 1 ) 



fc=i,2,... z v 7 V 2-2 r 



where the last equality follows from the fact that for all i < 2 m+1 , all experts with 
fc > 2 m+1 predict identically to /i( 2m+1 ). (Note that since the predictors are deterministic, 
for every m, ZlZ +1 -\h^) = 1%^ - 1 (fcW) .) 

Similarly, denoting n = 2L log2n J +1 , and invoking Lemma 1 with K = n and iV = n/2, 



Therefore, for any sequence y±, y2, ■ ■ ., 

[log 2 jij-l 

nL?(i/) = £ 2 m ^: +1 - 1 (^) + (n-n/2 + l)^ /2 (^) 

m=0 



[log 2 raj-l 

< E 2m I P in ^ +1_1 (/i (fc) ) + 

„ I fe — 1,2,... 



m=0 



'ln(2 m+i ; 
2 • 2 m 



|k>g 2 nj-l 



, ... „ /2 m ln(2 m+1 ) n/2 Inn 



L lo §2 «J 

= n min + V 



1 2m l n (2 m +!) 
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Denoting /i = [log 2 n\ , we may write 



m=0 ' "" ' m=0 

i : i i •• — ' ; ' " 



/ ln(2^+ 1 ) 2^ +1 )/ 2 
ln\og 2 n 



where 



Thus, we obtain 



^ - 2.01. 



y/2- 1 



Noting that for any fixed sequence y™ : L™(q, £7") is a sum of [0, 1]- valued independent 
random variables whose expectation is I/i(g), we may use Hoeffding's inequality [11] to see 
that for any sequence y™, and e > 0, 

P {\LU9, U?) - LUg)\ >e}< 2e- 2 " e2 . (4) 
Therefore, if L is now evaluated on the random sequence Y±, Y 2 , . . ., we obtain 

limsupL^C/™) < limsupf min L?(frW) + c\ bg2 W + 1 

n->-oo n->oo \fc=l,2,... y n 

= limsup min L^ih^) almost surely. 

n— >oo fc=l,2,... 

Thus, it remains to show that for any ergodic process Yi, Y 2 , . . ., 

lim sup min L?(/i (fc) ) < L* almost surely. (5) 

n^oo fc=l,2,... 

This will follow easily from the following lemma: 
Lemma 2 For any /c > 1, 

limsupL^/i^) < L* + €k almost surely, 

where e& > is such that lim^oo = 0. 
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Remark. If the process {Y n } happens to be m-th order Markov, then it is easy to see that 
e k = for all k > m. The performance of the predictor for such processes is investigated in 
the next section. 

Proof. Introduce 

1 n 

n i=i 

By Lemma 5 in the Appendix we immediately obtain 

hm |L?(/i (fc) ) - L?(/i (fc) )| = almost surely 

Therefore, it suffices to show that limsup^^ L^(h^) < L* + e k almost surely. To this end, 
first we study the asymptotic behavior of the quantity P{Y ^ (YZn+i^YZ^} ■ Notice 
that 

P{Yo^hW(Y^ +1 )\Y^} < I A nY^h^{Y^ +l )\YZ^} 

+I Bk P{Y^h^{YZ^ +1 )\YzL} 
+I Ck P{Y ^h^(YZ r l +1 )\YZ^} 
+I D cP{Y ? hl k \YZ r l +1 )\YZ^} (6) 

where 

A = {p{Yo = 1\YZ^}= 1 -}, 
B k = {P{F = 1\YZ^} < \ and P{Y = llYZ, 1 } < \) , 

Ck = {P{^o = O^} < \ and P(F = Ol^ 1 ) < \) , 
and D k = A U B k U C k . Notice that 

P{Yo^h^(YZ^ +1 )\YZ^} 

= P{Y = MYZ^}I { p, (lx -i +ilY -i H i } + P{y = l y -~} / {^(i,y-„ 1 +1 |r- 1 )>i } - ( 7 ) 

Now we examine the four terms on the right-hand side of (6). For the first term (7) yields 

I A P{Yo^hl k \Yz: +1 )\YZ^} = I A \ 

= I A mm(p{Y = l\YZ^},P{Y = 0\YZ^}). 
For the second term observe that under B k , for sufficiently large n, 

P k n ( 1 , YZ r ] +1 , YZI ) < - almost surely, 
and therefore by (7) we have 

Jim I Bk P{Y Q ± h^(YZ T l +1 )\YZ^) = I Bk min (P{F = P{F = 0|F^}) a.s. 
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For the third term we obtain similarly 

Km I Ck P{Y ? hl k \Y^ +1 )\Y^) = I Ck min (P{F = 1\Y^},P{Y = 0\Y^}) a.s. 
The last term is simply bounded by 

Id?. 

Combining all these bounds, we obtain 

P{^o ^ h^(Y^ +1 )\Y^} < I Dk P{Y + h^\YZ l n+ ,)\Yzl} + In* (8) 

and 

limI Dk P{Y ^h£\Y^ + J\Y^} = I Dk min a.s. (9) 

From (8) it is immediate that 

in 1 n 

limsupL?(Zi<*>) < lim -£ /^(T^P^ * fi^YT 1 )^} + Km -^/^(^y^), 

where T denotes the left shift operator defined on doubly infinite binary sequences yf ^ G 
{0, l}! ^. By this inequality and (9), Breiman's generalized ergodic theorem (see Lemma 4 
in the Appendix) implies 

lim sup Z^* ) < E [min (P{Y = l|y_-^},P{y = 0|y_-^})l + P{D c k } 
= L* + P{D c k } almost surely 

Since by the martingale convergence theorem 

lim P{Y = W-k) = P{Y = 1\Y-^} almost surely, 

k^oo 

we have 

lim P(D c k ) = 0. 

fe^oo 

Taking = P{-D£}, the proof of the lemma is complete. □ 
Now we return to the proof of Theoren 3. By Lemma 2, for arbitrary K, 

lim sup min < KmsupL?(/i (x) ) 

n— *oo k— 1,2,... n— *oo 

< L* + e K . 

Since K is arbitrary and ex — > 0, (5) is established, and the proof of the theorem is finished. 
□ 

Remarks. 1. The proposed estimate is clearly easy to compute. One merely has to keep 
track of the expected cumulative losses L^Z 1 ^^) for k — 1, 2, . . . , n. However, for large n, 
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storing the entire data history may be problematic. In such cases, more efficient tree-based 
data structures, such as the ones described by Feder, Merhav, and Gutman [10], may be 
applied. We do not investigate this issue further here. 

2. We see from the analysis that for any sequence y±, 1/2, ■ ■ ■ and for all n, 



and that the difference \L±(g, [/") — L\ (g) | between the actual loss and the expected loss is 
O p (n-^ 2 ). (For a sequence of random variables {X n } and sequence of nonnegative numbers 
{a n } we say that X n = O p (a n ) if for every e > there exists a constant c > such that 
limsup^,^ P{|X n | > ca n } < e.) The rate of convergence to L* depends on the behavior 
of the best expert for the time segment up to n. For example, in the next section we show 
that for m-th order Markov processes the m-th expert predicts very well, and this fact will 
suffice to derive performance bounds for the proposed predictor. 

3. The proposed predictor is by no means the only possibility. Different sets of experts 
may be combined in a similar fashion, and universality only depends on the behavior of the 
best expert. If some additional information is known (or suspected) about the process to 
be predicted, this information may be built in the definition of the experts. We chose the 
empirical Markov strategies as experts for convenience, and as we'll see it in the next section, 
this choice pays off whenever the process happens to be finite order Markov. 
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3 Markov processes 



In this section we assume that the process to predict {Yn} ?^ is (in addition to being station- 
ary and ergodic) m-th order Markov, that is, for any binary sequence yzlo = (• • • ,y-2,U-i), 

P{Y = 1\Y^ = yZl} = P{Y = 1\Y^ = yZ^}, 

where m is a positive integer. We show that the proposed predictor achieves a nearly optimal 
performance for any m and for any such process, even though the predictor does not use the 
knowledge that the process is m-th order Markov. The intuitive reason for such a behavior 
is the following: we have seen it in the previous section that for any sequence, 



L^g, U?) < fc min L\ (*<*>) + 3^^±^ + O p 




n 



On the other hand, if the sequence is m-th order Markov, then there exists an expert, namely 
fi( m ) with very good performance. 

In order to simplify our analysis, we modify the experts somewhat. They are defined as 
before but the probability estimates of (3) are now replaced by 



{k < i < n : y\'_ 


-1 = s ,yi = 


y} 


+ 1 




{k < i < n 


■ ytl = s} 


+ 2 



p£(v, yr\ s) = — - — : — — - — —h—- (io) 

{k < i < n : y\_l = s} + 2 

In other words, the simple empirical frequency counts are now replaced by the corresponding 
Laplace estimates. It is easy to see that all results of Section 2 remain valid for the modified 
predictor. 

Remark. The reason for this modification is that this way we can appeal to a result of 
Rissanen [20] which simplifies our analysis. We believe that similar performance bounds are 
true for the original predictor of Section 2. 

In the next theorem we compare the performance of our predictor to the universal lower 
bound L*. The statement only gives information about the expected loss, but we believe this 
result already illustrates the good behavior of the proposed predictor for Markov processes. 

Theorem 4 If the process to be predicted is a stationary and ergodic m-th order Markov 
process, then the cumulative loss L™(g) = L™(g,U™) of the prediction strategy of Section 2 
(with the modified estimates of (10)) satisfies 



V n V n V n 



where c > is a universal constant. 



11 



Proof. First note that (4) implies 



E 



\LUg,U{ l )-LUg)\\Y° 



oo 
oo 



< / 2e~ 2ne de< 



/ln(2e) 
2n 



(see, e.g., [9, page 208]), and therefore it suffices to investigate L™(g). Recall also from the 
proof of Theorem 3 that for any input sequence, 



^)< fe min L ?(/|(*)) + 3j!^±i, 



and, in particular, 



^(^)<L^)) + 3 / 0g2 ^ + 1 . 



Thus, it suffices to show that for m-th order Markov processes the performance of the m-th 
expert satisfies 

2 m - 1 logn fc 
— + 



n 



EL"(/i (m) ) < L* + 2 
for some constant c. To this end, observe that, on the one hand, 

1 



EL?(/i ( 



m) n 



= E 
= E 



i=i 

1 ^ 



n 



1=1 



and on the other hand, by the Markov property, 

1 



L* = E 
= E 



- ]T min (P {Y t = 1\Y™} , P {Y t = 1\Y™}) 

71 2 = 1 
1 11 



where M" 1 -*) is the Bayes decision, given, for any s e {0, l} m , by 



1 ifP{y = l|Y^ = a}>l/2 
otherwise. 



(Note that the optimal predictor, that is, the one which minimizes the probability of error 
at every step predicts according to h^ m '*\) 
The above equalities imply that 

1 n 

EL?(/i<™>) - L* < -£E|p{/4 m) (yrV^I^^ 



i=i 



O 7i 

< - £ e lir a, *r \ - p = ii^r 1 }! , 

^ i 
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where the second inequality follows by [9, Theorem 2.2]. In the rest of the proof we simply 
apply some known results from the theory of universal prediction. First, by applications of 
Jensen's and Pinsker's inequalities (see Merhav and Feder [15, eq. (20)]) we obtain 

r) n 

- E e lira, *r \ *t™) - p = livr 1 }! 
n i=i 



< 2 



X E E pp? ; >a 1 >Ep{^ m 1 *i l }io g p ° /|iri -" r 



i_1 ) 

) i)i—m) 



Observe that on the right-hand side, under the square root sign, we have the normalized 
Kullback-Leibler divergence between the probability measure of Y™ and its estimate con- 
structed as a product of the Laplace estimates (10). But this divergence, for m-th order 
Markov sources, is well-known to be bounded by 



2 m (\ 
— logn + O - 

In \n 



see Rissanen [20]. This concludes the proof. □ 

Remarks. 1. As Theorem 4 shows, by exponential weighting of the empirical Markov 
strategies, the predictor automatically adapts to the unknown Markov order. Similar re- 
sults, though in different setup, are achieved by Modha and Masry [13], [14] by complexity 
regularization. 

2. Merhav, Feder, and Gutman, [16] showed that if the process is m-th order Markov, then 
the randomized predictor defined by 



ht\y[-\u) = li if iT(o,yr\y^) < \ 

. i{c/>i/2} otherwise 

achieves 'EL^(h^) — L* < C/n, where C is a constant depending of the distribution of the 
process. However, in an interesting contrast, the best distribution-free upper bound for all 
m-th order Markov processes is of the order of n^ 1 ^ 2 . To illustrate this, consider the case 
m = 0, that is, when {Y n } is an i.i.d. process with P{Y\ = 1} = 1/2 + 0, and the predictor 
is based on a majority vote of the bits appeared in the past. In this case, for every n, 

sup (EL?(^(°)) -L*) > c lU - 1/2 , 
0e[-i/2,i/2] v ' 

where C\ is a universal constant. (This is straightforward to see by considering 9 = cn~ x l 2 



( if ^(0,yr\sCm)>5 
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for some small constant c, and writing 



i n r / 1 

EL^h^-L* = -Y J V{hV\Yt\U i )^Y i }-(--0 
n i=1 L \^ 



-J229P{h^\Yr\U t )=0} 



n 



=1 



in I »-l j _ 1 

= ^E 2 « p (i:«- E n)<-('-i)^- 



i=l 



U =1 



Finally, invoke the Berry-Esseen theorem (see, e.g., [8]) to deduce that there exists a universal 
constant c 2 such that P {e5=i(^ - < - > c 2 for every 2 < i < n.) Thus, 
even though for every single value of 9, EL™(/^ m )) — L* converges to zero at a rate of 0(1/ n), 
the minimax rate of convergence is, in fact, 0(1/ \/n). Since the upper bound in Theorem 4 
is independent of the distribution, we see that, in this sense, (ignoring logarithmic factors) 
the order of magnitude of the bound is the best possible. 
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4 Prediction with side information 



In this section we apply the same ideas to the seemingly more difficult classification (or 
pattern recognition) problem. The setup is the following: let {(X n , l^)}^ be a stationary 
and ergodic sequence of pairs taking values in 1Z d x {0, 1}. The problem is to predict the 
value of Y n given the data (X n , V^ 1 ), where we denote = (X? -1 , Y^ 1 ). The prediction 
problem is similar to the one studied in Section 2 with the exception that the sequence of 
Xj's is also available to the predictor. One may think about the X^s as side information. 

We may formalize the prediction problem as follows. A (randomized) prediction strategy 
is a sequence g = {g i }°^ =1 of decision functions 

gr-iOAY-'x (n^x [0,1] -{0,1} 

so that the prediction formed at time % is <7i(yi -1 , x\,Ui). The normalized cumulative loss for 
any fixed pair of sequences x", y\ is now 

1 n 

a i=i 

We also use the short notation i?™((?) = -R™ ((?, £/"). Denote the expected loss of the random- 
ized strategy g by 

R n 1 (g) = ^R n 1 (9,U r 1 l ). 

We assume that the randomizing variables Ui, U2, ■ ■ ■ are independent of the process {(X n , Y n )}. 

Just like in the case of prediction without side information, the fundamental limit is given 
by the Bayes probability of error: 

Theorem 5 For any prediction strategy g and stationary ergodic process {(X n , F„)}^_ 007 

liminf Ri(g) > R* almost surely, 

where 



R* = E 



min (P{F = 1\Y^,X°^},P{Y = 0\Y^, X ^}) 



The proof of this lower bound is similar to that of Theorem 1, the details are omitted. 
It follows from results of Morvai, Yakowitz, and Gyorfi [17] that there exists a prediction 
strategy g such that for all ergodic processes, Ri(g) — > R* almost surely. (We omit the 
details here.) The algorithm of Morvai, Yakowitz, and Gyorfi, however, has a very slow 
rate of convergence even for i.i.d. processes. The main message of this section is a simple 
universal procedure with a practical appeal. The idea, again, is to combine the decisions of 
a small number of simple experts in an appropriate way. 

We define an infinite array of experts h^ k,l \ k, i — 1, 2, . . . as follows. Let Vi = {A£j,j = 
1,2, .. . ,mi} be a sequence of finite partitions of the feature space 7Z d , and let Gg be the 
corresponding quantizer: 

G e (x) = j, if x e A id . 
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With some abuse of notation, for any n and x™ € [TZ d j , we write G^(x") for the sequence 
Ge(xi), . . . , Ge(x n ). Fix positive integers k, £, and for each s e {0, l} fe , 2 G {1,2,..., m^} fc+1 , 
and y £ {0, 1} define 

Pi k ' e) (y,yr\^s,z) = ^ k — \\ k) ' y / J , n>k + l. (11) 



{A; < i < n : y\_\ = s, G e (x\_ k ) = z, } 

0/0 is defined to be 1/2. Also, for n < k + 1 we define P( k/) {y, y™' 1 , x?, s, 2) = 1/2. 
The expert h^'^ is now defined by 

11/1 ' XlJ ~ \ 1 otherwise, ~ n-L,Z,... 

That is, expert quantizes the sequence x™ according to the partition 7-^, and looks for 
all appearances of the last seen quantized strings y™ll, Ge(x™_ k ) of length k in the past. 
Then it predicts according to the larger of the relative frequencies of 0's and l's following 
the string. 

The proposed algorithm combines the predictions of these experts similarly to that of 
Section 2. This way both the length of the string to be matched and the resolution of the 
quantizer are adjusted depending on the data. The formal definition is as follows: For any 
m = 0,1,2,..., if 2 m < n < 2 m+1 , the prediction is based upon a weighted majority of 
predictions of the (2 m+1 ) 2 experts h^ k,e \ k,l < 2 m+1 as follows: 



g n {yl \x\,u) 



n ., ^ E M <2-+i^ M) (yr \x?)w n (k,. 

if u > — — — — — 

Efc,^< 2 -+i w n {k,£) 

1 otherwise, 



where w n (k,£) is the weight of expert h^ k ^ defined by the past performance of h^ k ^ as 

w 2m (k,£) = l and w n (kJ) = e~^ in " 2m)R ^ 1{h(k ' e) Hoi2 m <n<2 m+1 , 
where r] m = ^8 ln(2 m+1 ) 2 /2 m . 

To prove the universality of the method, we need some natural conditions on the sequence 
of partitions. We assume the following: 

(a) the sequence of partitions is nested, that is, any cell of Ve+i is a subset of a cell of Ve, 
£=1,2,...; 

(b) each partition Ve is finite; 

(c) if diam(A) — sup x y£A || x — y\\ denotes the diameter of a set, then for each sphere S 
centered at the origin 

lim max diam(A/,) = 0. 

Remark. The next theorem states the universality of the proposed pattern recognition 
scheme. The definition of the algorithm is somewhat arbitrary, we just chose one of the 
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many possibilities. In this version, at time n, only partitions with indices at most n are taken 
into account. It is easy to see that the universality property remains valid if the number of 
partitions considered at time n is an arbitrary, polynomially increasing function of n. The 
conditions for the sequence of partitions again give a lot of liberty to the user. In applications, 
the partitions may be chosen to incorporate some prior knowledge about the process. In this 
paper we merely prove universality of the scheme. Performance bounds in the style of Section 
3 for special types of proceses may be derived, thanks to the powerful individual sequence 
bounds. Here, however, the analysis may be substantially more complicated. 

Theorem 6 Assume that the sequence of partitions Vt satisfies the three conditions above. 
Then the pattern recognition scheme g defined above satisfies 

lim Ri(g) = R* almost surely 
for any stationary and ergodic process {(X n , Yn)}™=-oc- 



Proof. As in the proof of Theorem 3, we obtain that for any stationary and ergodic process 
{{X n , Y n )Y r 



loo 

i* n=— oo' 



lim sup R"(g,U?) < lim sup 



min R n l (h^) + 2cJ l ° g2n + 1 
fc = i,2,... V n 

\ i = 1,2, ... ,n - 1 

= lim sup min R^(h^ k '^) almost surely. 

n^oo k = 1, 2, . . . 

i = 1,2,. ..,n- 1 



Thus, it remains to show that 

lim sup min Rl(h {k/) ) < R* almost surely. 

n^oo k = 1, 2, . . . 

i = 1,2, ... ,n - 1 

To prove this, we use the following lemma, whose proof is easily obtained by copying that 
of Lemma 2: 



Lemma 3 For each k,£ > 1, there exists a positive number €k,e such that for any fixed £, 
Hindoo e k ,e = and 

limsup^(/i^)< J R^ ) + e M , 



where 



R (£) - E 



min (P{F = l|H-i,^(X°J},P{Y = 0|^, G^J})' . 



Now we return to the proof of Theorem 6. Since the sequence of partitions Vg is nested, and 
by (c), the sequences 



P{Y = l\Y^,G e (X OD )} and P{F = 0|F^, G^J} 1 = 1,2,... 
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are martingales and they converge almost surely to 

P{F =l|^,X oo } and P{Y = 0\Y^ o ,X°_ oo }. 
Thus, it follows from Lebesgue's dominated convergence theorem that 

lim R* {e) = E [min (P{F = X ^}, P{F = O^X ^}) 

Now it follows easily that 



lim sup min EH&hPA) < R* almost surely, 

n— >oo k = 1, 2, . . . 

E=\,2,...,n-1 



and the proof of the theorem is finished. 
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5 Appendix 



Here we describe two results which are used in the analysis. The first is due to Breiman [5], 
and its proof may also be found in Algoet [2]. 

Lemma 4 Breiman's generalized ergodic theorem [5]. Let Z = {ZA-™^ be a sta- 
tionary and ergodic time series. Let T denote the left shift operator. Let f\ be a sequence of 
real-valued functions such that for some function f , fi(Z) — > f(Z) almost surely. Assume 
that Esupj \fi(Z)\ < oo. Then 

\im-j2MT l Z) = Ef(Z) 

almost surely. 

The second is the Hoeffding-Azuma inequality for sums of bounded martingale differences: 

Lemma 5 HOEFFDING [11], AZUMA [3]. Let X 1: X 2 ,... be a sequence of random vari- 
ables, and assume that Vi, V 2 , ... is a martingale difference sequence with respect to Xi, X 2 , 
Assume furthermore that there exist random variables Z 1 , Z 2 , . . . and nonnegative constants 
ci, c 2 , . . . such that for every i > Zi is a function of Xi, . . . , Xj_i 7 and 

Zi < Vi < Zi + Ci with probability one. 

Then for any e > and n 
and 

p{|><- e }< e - 2 ^-< 
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